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Abstract 

Background: Combining different sources of knowledge to build improved structure activity relationship models is 
not easy owing to the variety of knowledge formats and the absence of a common framework to interoperate 
between learning techniques. Most of the current approaches address this problem by using consensus models 
that operate at the prediction level. We explore the possibility to directly combine these sources at the knowledge 
level, with the aim to harvest potentially increased synergy at an earlier stage. Our goal is to design a general 
methodology to facilitate knowledge discovery and produce accurate and interpretable models. 

Results: To combine models at the knowledge level, we propose to decouple the learning phase from the 
knowledge application phase using a pivot representation (lingua franca) based on the concept of hypothesis. A 
hypothesis is a simple and interpretable knowledge unit. Regardless of its origin, knowledge is broken down into a 
collection of hypotheses. These hypotheses are subsequently organised into hierarchical network. This unification 
permits to combine different sources of knowledge into a common formalised framework. The approach allows us 
to create a synergistic system between different forms of knowledge and new algorithms can be applied to 
leverage this unified model. This first article focuses on the general principle of the Self Organising Hypothesis 
Network (SOHN) approach in the context of binary classification problems along with an illustrative application to 
the prediction of mutagenicity. 

Conclusion: It is possible to represent knowledge in the unified form of a hypothesis network allowing 
interpretable predictions with performances comparable to mainstream machine learning techniques. This new 
approach offers the potential to combine knowledge from different sources into a common framework in which 
high level reasoning and meta-learning can be applied; these latter perspectives will be explored in future work. 

Keywords: Machine learning, Knowledge discovery, Data mining, SAR, QSAR, SOHN, Interpretable model, 
Confidence metric, Hypothesis Network 



Background 

Developing robust Structure Activity Relationship (SAR) 
models for a given endpoint requires a good understanding 
of the interaction between the chemical compound 
and the targeted biological system. This knowledge can 
be extracted from experimental observations compiled in 
the form of datasets. Thorough analysis of these datasets, 
either manually by expert scientists (Expert Learning) or 
automatically by computer algorithms (Machine Learning) 
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[1], can provide a better understanding of the structure 
activity relationships. The resulting knowledge can be 
captured in the form of SAR rules, and implemented in 
expert systems, or lead to quantitative statistical models 
(QSAR). In both cases the underlying knowledge be- 
comes a central asset for further exploring the end- 
points mechanism or predicting the behaviour of unseen 
compounds. 

Depending on the learning method used, the resulting 
knowledge will present different qualities associated with 
various use cases. Although the accuracy of SAR and QSAR 
models has for a long time been the primary concern and 
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used as the main performance measure [2], there are other 
important facets amongst which are: 

The transparency of the model and the interpretability 
of the predictions they provide [3-5] 
The ability to estimate the confidence in an individual 
prediction [6,7] 

The possibility to define an applicability domain of 
the model [8-12] 

The proportion of structures for which an actual 
classification can be made (coverage) 
The ability to provide supporting evidence (e.g. real 
world examples) 

All these properties have gained importance recently 
especially in the context of risk assessment where they 
are partially covered by the Organisation for Economic 
Cooperation and Development (OECD) guidelines [13]. 

Addressing all of these aspects in parallel is particularly 
challenging and as a result it is difficult to design the 
perfect model that resolves all of them simultaneously. 
Fortunately the end users can combine their own expertise 
with different tools and sources of knowledge (expert 
systems, statistical models, databases, etc.) in order to 
build a well informed decision (Figure 1). 

Whilst there is a large choice of such tools, it is difficult to 
build a framework to combine their knowledge and help le- 
verage the potential synergy between different models. For 
instance interoperability between different approaches is 
challenging due to a lack of unified knowledge representa- 
tion and there is no standard way to provide overall decision 
support. The main reason for this is that we are lacking a 
common language that would help us take advantage of dif- 
ferent forms of knowledge within a common framework. 

A common practice is to combine the predictions of 
several models into an overall outcome using consensus 



based reasoning. This approach has sometimes proven 
to provide better performance than the separate individ- 
ual models [14]. In this consensus method, models are 
combined at a late stage i.e. the prediction phase. It is 
legitimate to investigate if there is room for further im- 
provement by merging these models at an earlier stage, 
i.e. directly at the knowledge level and in so doing, lever- 
age any form of synergy prior to predicting (Figure 2). 
Furthermore, if we manage to combine sources of know- 
ledge rather than final predictions we can develop a 
common library of algorithms that can be applied to this 
higher level abstraction. With this question in mind we 
initiated a project that aims to build a common know- 
ledge platform along with algorithms to facilitate know- 
ledge discovery and produce interpretable and accurate 
predictions. We present here our first step towards this 
goal in the form of a novel way to abstract and organize 
knowledge which will become the cornerstone of this 
new approach. 

Method 

The proposed approach is based on 3 key steps. First, 
derive knowledge from different sources of information 
(experimental data, expert learning, machine learning). 
Secondly, unify the different knowledge representations. 
Finally, organise the unified knowledge in a way that 
captures its generalisation hierarchy and facilitates the 
design of efficient prediction algorithms (Figure 3). 

Different learning methods will usually use different 
knowledge formats which make the knowledge merging 
process difficult. One possible solution is to define a 
common representation ('lingua franca ) in order to 
combine the knowledge produced by various learning 
techniques. We also want the final global model to be 
transparent and thus the common representation must 
be composed of interpretable elements of knowledge. 
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Figure 1 Decision supporting tools. End users often rely on various combinations of tools and knowledge representations to build a decision. 
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Figure 2 Consensus model vs. Unified knowledge. Two ways of combining different learning methods; the consensus approach by merging 
the prediction (late stage) and the unified knowledge approach by combining the knowledge (early stage). 



For that purpose we introduce the concept of hypoth- 
esis: a simple and interpretable knowledge unit. In the 
first step we collect knowledge relevant to the target 
endpoint using several sources and we break down this 
knowledge into a collection of hypotheses. Once the 
hypotheses have been generated we can organise them 
into a hierarchical structure that captures the different 
level of generalisation of the knowledge. This structure 
is automatically updated each time a new hypothesis is 
inserted following a specific algorithm that ensures the 
overall consistency of the knowledge. The structure is 
called a Self-Organising Hypothesis Network or SOHN. 

Finally, powerful algorithms can take advantage of the 
SOHN for analysing the underlying knowledge and 
perform accurate and transparent predictions. These 
algorithms no longer depend on the initial specific 
learning method or knowledge representation; hence 
they can be mutualised across the different mining 
techniques. In this higher level of processing, the unified 
and structured knowledge representation can be fully 
leveraged. 

Knowledge unification 

The central concept of our approach is the hypothesis. 
A hypothesis defines the scope of a class of compounds 
sharing a SAR trend for a given endpoint. From a 



classification viewpoint, a hypothesis defines a class of com- 
pounds in the chemical space that fully or partially sepa- 
rates the endpoint classes. A hypothesis can be seen as a 
local model representing a knowledge unit. 

In practice, hypotheses can take different forms de- 
pending on the molecular information that they take 
into account (Figure 4). For instance structural hypoth- 
eses can represent a class of compounds containing a 
given functional group or more generally a structural 
pattern. Other hypotheses may be based on physico- 
chemical or pharmacophoric properties of the structure. 
There is no restriction on how the scope of hypotheses 
may be defined, for instance, we can even define a hypoth- 
esis that matches all the structures that are similar to 
nitro-benzene with a Tanimoto index greater than 0.8. 
Hypotheses are simply an abstract way to express SAR 
assumptions. 

Although there are no predefined rules regarding their 
scope, all hypothesis types must provide a minimum of 
functionalities, i.e. fulfil a contract, in order to build our 
unified knowledge framework. The contract for a given 
type of hypothesis contains only three clauses. 

Instance matching 

We must be able to check if a hypothesis covers (i.e. 
subsumes) a given structure. We say that a hypothesis 
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Figure 3 The SOHN methodology. Different sources of knowledge are unified using a common representation based on the concept of 
hypothesis. The hypotheses can be organised into a hierarchical network to capture the knowledge in a standardised way. 
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Figure 4 Different types of hypotheses. Hypotheses can capture 
different attributes of a compound relevant to the target endpoint. 
The SOHN abstraction can handle different types of hypotheses 
within the same SOHN. 



h covers a structure s if s belongs to set of structures 
that h matches. The latter is also known as the scope 
of h and this relationship is noted: s e Scope{h). 

For instance in the case of a structural hypothesis this 
clause can be implemented using the presence or absence 
of a structural fragment. 



Domain matching 

We need to know if a given compound belongs to the 
applicability domain of a given type of hypothesis. The 
applicability domain of a hypothesis h is the set of struc- 
tures for which the hypothesis can be reliably evaluated. 
This relationship is noted: s e Domain{h). 

For instance for structural hypotheses, a reference 
dictionary of common moieties in the target chemical 
space can be built and this clause could state that a 
structure belongs to the applicability domain if all its 
atoms and bonds are covered by at least one fragment 
from this dictionary. 



Similarity measure 

Each type of hypotheses must provide a way to compute 
a symmetrical and normalised index of similarity (ranging 
from 0 to 1) between two instances of the input space. 
The similarity index between two structures s x and s 2 
given a hypothesis h is noted: Similarity^ s lf s 2 )« 

In the case of structural hypotheses it is, for instance, 
possible to use a fingerprint based Tanimoto index to 
implement this clause. 



These simple functionalities are all that is required in 
order to define a new type of hypothesis. 

Once a hypothesis type has been designed and its 
contract implemented we can derive two fundamental 
properties of a given hypothesis. 

Coverage 

Given a reference dataset D (set of instances) the coverage 
C of a hypothesis h is the subset of instances for which 
the hypothesis applies (intersection between the scope of 
h and the dataset). 

C(hfi) = {s\seD,s£Scope(h)} 



Signal 

Given a reference dataset D (collection of instances) the 
signal of a hypothesis h for a given class (or label) y is 
defined as follows: 



SyfcD) 



L x n y (h,D)-N(h,D) 



(L-l) xN(h,D) 
n y (h,D) = | {5 e C(/z,Z));sej/}| = number of instances labelled y 

N(h,D) = ^^n y (h,D) = \C(h,D)\ = number of instances covered by h 

yeY 

L = \Y\ 

where Y is the set of classes or labels for the target 
endpoint. 

The signal indicates the deviation in the distribution of 
the classes (compared to a balanced distribution n) 
amongst the covered examples. A signal value of 0 for a 
class y indicates that the hypothesis does not discrimin- 
ate this class from other classes. A signal value of 1 
means that the hypothesis covers only examples from 
the class y and is therefore fully discriminating. 

From a learning viewpoint, a good hypothesis is a one 
that combines a strong signal and a wide coverage. Such 
a hypothesis represents an interesting concept related to 
the target endpoint. 

We can now consider different sources providing ele- 
ments of knowledge in the form of simple hypotheses 
(Figure 5). For instance, human experts can contribute 
knowledge in the form of structural alerts. Automated 
machine learning can also be applied in the knowledge 
extraction phase with transparent models providing 
interpretable rules that can be directly transcribed 
into a set of corresponding hypotheses. This applies 
to machine learning techniques like Inductive Logic 
Programming [15] and Decision Trees [16] (DT). In 
the case of less transparent models, like Support Vector 
Machines [17] (SVM) or Random Forests [18] (RF) etc., it 
is possible in some cases to apply automatic rule extrac- 
tion techniques and transform these identified rules into 
hypotheses. Finally experimental data represent factual 
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Figure 5 Combining different source of knowledge. Different sources of knowledge can be merged into a structured and unified model. This 
permits to combine different learning techniques (expert learning, machine learning and facts). 



knowledge and each observation becomes a very specific 
hypothesis (only one instance covered and a signal value 
equal to 1). Example based hypotheses usually provide 
a very poor level of generalisation but they represent 
experimental evidence which makes them extremely 
valuable. 

Knowledge organisation 

Once hypotheses have been gathered from one or 
several sources they can be organised into a useful 
structured knowledge representation. Hypotheses con- 
vey very local models for a given endpoint, these units 
of knowledge can be organised into a meaningful hier- 
archy. From the knowledge mining perspective the 
most important relationship between hypotheses is 



their degree of generalisation. For instance, in Figure 6, 
hypothesis A is more general than hypothesis B since 
all the structures covered by B are necessarily covered 
by A. Similarly the logP based hypothesis C is more 
general then D. 

In both cases it is easy to understand how to define 
the order since we compare hypotheses of the same 
types. We intuitively perform a substructure compari- 
son between A and B and find A included in B and 
thus can state that it is more general. We can also per- 
form a logP range comparison between C and D and 
infer that C is more general than D because its logP 
scope includes the range of values allowed in D. How- 
ever when we try to compare hypotheses of different 
types, the ranking becomes more difficult and less 
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Figure 6 Ordering hypotheses according to their degree of generalisation. Hypotheses can be ordered according to their level of 
generalisation. Here, hypotheses A is more general than hypothesis B (A is a substructure of B). Similarly C is more general than D. 
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Figure 7 Difficulty in ordering hypotheses of different types. Comparing the level of generalisation between hypotheses of different types is 
not trivial and in some cases impossible when the underlying molecule attributes are not directly comparable. 



intuitive. For instance in Figure 7, which hypothesis is 
the more general? 

It is not possible to answer the question by simply 
comparing the two hypotheses. Instead we need a way to 
define generalisation independently from the hypothesis 
type. In our approach we propose to use a reference 
dataset so we can define a generalisation order based 
on the coverage of the hypotheses within this dataset. 
The method is illustrated in Figure 8 with a reference 
dataset containing 10 examples (e! to e 10 ). Let us as- 
sume that a hypothesis h 2 covers examples e! to e 5 and 
a second hypothesis h 2 applies to the examples e lf e 2 
and e 3 . We can now intuitively infer that h 2 is more 
general than h 2 since all the examples covered by h 2 
are also covered by hi. Here we simply use the first 
clause (matching) of the hypothesis contract given a 
reference dataset D. 

hi > h 2 iff VeeScope(h 2 )=>eeScope(hi). o C{h 2 ,D) cC{h u D) 
(where > means'more general than') 

The definition implies that if two hypotheses have identi- 
cal coverage, they do not have a parent/child relationship. 



Note that there are two extreme cases of hypotheses. 
First the whole chemical space can be seen as the result- 
ing coverage of a hypothesis that matches any structure 
and therefore represents the most general hypothesis 
possible; we call it the root hypothesis h 0 . On the other 
end of the spectrum, each example is a form of an ultim- 
ately specific hypothesis that covers only itself. Thus in 
terms of generalisation, in our illustration h 0 > h 2 > h 2 > e 2 . 
Pairs of hypotheses that do not have subset/superset 
relationships are said to be 1 incomparable' (e.g. h 2 and 
h 4 in Figure 9). 

Having defined a universal way to compare the hy- 
potheses according to their degree of generalisation, 
we can now organise them into a hierarchical network 
from the most general (the root) to the most specific 
(the reference dataset examples) creating a top-down 
generalisation axis and a bottom-up instantiation axis 
(Figure 9). 

We have developed an algorithm that automatically 
updates the network when a new hypothesis is inserted 
(or removed) and we call the resulting data structure a 
Self- Organising Hypothesis Network (SOHN). 




Reference dataset 
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Figure 8 A more universal way to order hypotheses. Using a reference dataset we can define the generalisation hierarchy based on 
the coverage of the hypotheses. Thus we do not depend on the type of information handled and we can compare hypotheses of 
different types. 
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SOHN algorithm 

Inserting a hypothesis h into a SOHN at its root hypothesis ho follows the following 
algorithm: 

insertfho, h) 

parents = findMaxSpecificParents(h 0/ h) //find the most specific parent hypotheses 

children = findMaxGenericChildren(h 0/ h) // find the most generic children hypotheses 

for each parent in parents 

children(parent) = children(parent) + h // add the new hypothesis as a child of the most specific 
parent 

for each parent in parents 

for each child in children 

children(parent) = children(parent) - child // remove redundant paths 

for each child in children 

children(h) = children(h) + child // add the new hypothesis as a parent of the most generic 

child 

findMaxSpecificParentsfh!, h 2 ) 

parents = {} 

if(h!> h 2 ) //if hi is more general than h2 recursively look for more specific parents 

for each child in children^!) 
parents = parents + findMaxSpecificParents(child, h 2 ) 

if (parent == {}) parents = { hj// if parents is empty we have found a most specific parent 
return parents 

findMaxGenericChildrenlhi, h 2 ) 

children = {} 

if(h 2 > children = { hj //ifwe have found a child stop 
else //else recursively look for generic children 
for each child in children^!) 
children = children + findMaxGenericChildren(child, h 2 ) 

return children 
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Figure 9 Self Organising Hypothesis Network. Using the 'is more general than' relationship a collection of hypotheses can be organised into a 
hierarchical network from the most generic (the root) to the most specific (the examples). 



An implementation of the algorithm will typically also 
allow making decision about inserting or not a hypoth- 
esis based on the information gain depending on the 
insertion point. 

Although the structure of a SOHN is very close to a 
Galois lattice used in Formal Concept Analysis [19] 
(FCA) and hypotheses play here a similar role to con- 
cepts in FCA, there are fundamental differences be- 
tween the two approaches. Whereas in FCA, concepts 
are derived from the examples, in the SOHN approach 
hypotheses are not necessarily inferred from the refer- 
ence dataset. For instance a hypothesis can be designed 
by a human expert. In the SOHN method, examples in 
the reference dataset are principally used to organise 
and support the hypotheses. Furthermore, in FCA the 
relation 'is more general than' is defined at the attri- 
bute level (and by extension, at the implied covered 
example level) and assumes that all the concepts work 
in the same predefined attribute space. In the SOHN 
approach the relation depends only on the example 
coverage and, it is therefore a less constrained method- 
ology. This allows us to combine different sources and 
types of hypotheses and subsequently project them on 
an independent reference dataset. In FCA the example 
set and the inferred concepts are interdependent by 
principle whereas in the SOHN approach example and 
hypotheses are inherently independent although they 
are often related in practice as a consequence of the 
knowledge extraction process. Nevertheless the SOHN 
method can also benefit from some of the FCA tech- 
niques when these differences are not critical, the 
exploration of such opportunities is not in the scope of 
this article. 

It is important to note that the reference dataset is 
required to establish the hypothesis hierarchy and that it 



becomes part of the final network. Different reference 
datasets may induce different hierarchies of hypotheses; 
this reflects the fact that the level of generalisation is not 
intrinsic to a hypothesis but depends on the chemical 
space in which the hypotheses are applied. Also, the 
reference dataset is not to be mistaken with a training 
dataset. 

Once the SOHN has been constructed for a collection 
of hypotheses and a given reference dataset, it is possible 
to analyse the signal and coverage of each hypothesis. In 
Figure 10 each example is colour coded; green for posi- 
tive examples e!-e 5 and red for negative examples e 6 -e 10 . 
The colours for the hypotheses correspond to the intensity 
of their signal. From a learning perspective we expect the 
hypotheses to have a good combination of strong signal 
and wide coverage. These two parameters are usually 
antagonistic and we expect the different sources of 
knowledge to optimise knowledge extraction and pro- 
vide "good" hypotheses during the learning phase. The 
quality of the knowledge contained in a SOHN will 
therefore depend on the learning performance of the 
individual hypotheses providers (not the SOHN meth- 
odology itself); additional knowledge may emerge from 
the combination and organisation of this knowledge 
within the SOHN as the result of a synergetic effect. 

The resulting SOHN and its associated information 
network is a computer friendly data structure and this 
knowledge structure can be efficiently used in the con- 
text of SAR analysis and for the prediction of unseen 
compounds. 

Results 

An example of SOHN for the mutagenicity endpoint 

This section illustrates a simple example of a SOHN 
network for hypotheses designed to capture knowledge 
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Figure 10 SOHN with activity information. When overlaying the signal of the hypotheses for a given endpoint, the SOHN becomes an 
interesting SAR analysis tool. 



about the toxicological mutagenicity endpoint. The 
knowledge was mined from a dataset of 8,201 struc- 
tures distributed over 2 classes, 50% mutagen, and 50% 
non mutagens. Only 2 sources of hypotheses were 
used. The first source is the dataset itself, providing 
the top level example hypotheses and the second is a 
machine learning algorithm, namely a Decision Tree, 
producing structural hypotheses. Decision Trees were 
chosen because they provide directly interpretable 
nodes which are easy to convert into hypotheses. For 
this purpose the 8,201 structures have been fragmen- 
ted into -20,000 fragments using an in-house algo- 
rithm based on a reduced feature graph [20,21]. Rather 
than to fragment the original molecule, our fragmenta- 
tion method first builds an intermediate reduced graph 
where all the nodes represent a structural unit of the 
original molecule. The scope of a structural unit is 
flexible and can be adjust to different use-cases (chemical 
reactivity, pharmacophoric abstraction, etc.). Structural 
units can for instance represent single atoms and bonds, 
functional groups, rings, fused rings, etc. Once the re- 
duce graph has been constructed we fragment this 
graph using a combination of circular and linear path 
enumeration. Each fragment generated from the re- 
duced graph is expanded back to a molecular fragment 
graph. The depth of the path enumeration can be con- 
figured. This fragmentation method allows us to take 
advantage of an exhaustive path enumeration without 
the risk of breaking logical units within the molecules 
(Figure 11). 

Finally, fragments present in at least 4 examples, were 
used as 2D structural descriptors to build a Decision 
Tree resulting in 400 nodes. Fragments retained as 
nodes in the Decision Tree represent good hypothesis 
candidates and all 400 nodes were converted into the cor- 
responding structural hypotheses. Note that the Decision 



Tree algorithm used ensures that each decision node 
corresponds to a more specific hypothesis than its par- 
ent allowing the Decision Tree hierarchy to be redis- 
covered during the SOHN building process. In order to 
optimise the knowledge extraction process hypotheses 
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Figure 1 1 Fragmentation methodology. 
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Figure 12 Example of workflow used to convert a dataset into a SOHN representation. Structures in the dataset are first fragmented. The 
knowledge is then mined using a recursive partitioning approach to identify information-rich fragments which are converted into corresponding 
structural hypotheses. The input dataset is also used as the reference example dataset to build the SOHN. 



can be further filtered against different criteria like 
information gain, minimum coverage, etc. Note that 
we are not interested in the Decision Tree as a final 
model, we used this machine learning technique only 
to identify relevant structural patterns that we even- 
tually capture in the form of SOHN hypotheses 
(Figure 12). 

Finally the 400 hypotheses have been organised into 
a SOHN following the methodology described earlier. 
At this point we are no longer dependent upon the 
source of the hypotheses i.e. we become agnostic of the 



learning method used. A simplified version of the result- 
ing SOHN is shown in Figure 13. 

SOHN as SAR analysis tool 

In the SOHN built in the previous section, the general 
hypotheses (first level above the root) with a positive 
signal (mutagenic activity) correspond to common struc- 
tural alerts for this endpoint. Indeed, moieties like aromatic 
nitro groups, aromatic amines, epoxides, and aziridines 
are known toxicophores. The presence of these expected 
patterns means that the Decision Tree algorithm, using 




Simplified mutagenicity SOHN 



Figure 13 Example of SOHN for the mutagenicity endpoint. A simplified view of the SOHN network build from 8,201 structures for the 
mutagenicity endpoint. In this representation, fragment hypotheses are open structures whereas example hypotheses are closed structures. 
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the fragment descriptors, succeeded in mining relevant 
knowledge from the dataset These hypotheses represent 
general rules covering a relatively large proportion of 
the dataset and are therefore key elements of know- 
ledge for the studied endpoint At a more specific level, 
hypothesis h 6 tells us that the mutagenic effect of the 
aromatic nitro group (h 2 ) can be mitigated by the pres- 
ence of a sulfonic acid group in the meta position. h 6 is 
a child of h 2 and therefore represents a refinement rule 
(more specific hypothesis). The signal contrast between h 2 
and h 6 clearly indicates an interesting activity change and 
the structural transition from h 2 to h 6 provides a useful 
explanation. 

More generally each path in the SOHN contains 
rich SAR information helpful in many contexts such 
as: 

• activity cliff detection [22] 

• matched molecular pairs identification [23] 

• knowledge discovery and refinement 

• lead optimisation 

The most specific hypothesis (the observed data 
points) provide underlying supporting evidence. For in- 
stance hypothesis h 6 is supported by several examples in 
the reference dataset among which e 6 . 

The SOHN structure provides a rich representation 
of the knowledge associated with a given endpoint and 
facilitates the SAR analysis. When graphically displayed, 
SOHN provide useful visual information about the 




Figure 15 Example of prediction analysis. To find the most relevant knowledge that will contribute to a prediction, the SOHN algorithm 
recursively searches the network for the most specific hypotheses that apply to the query compound, starting from the root. In the current case 
h 6 and h 7 have been identified as the most specific hypotheses (none of the examples matches the query). 




Figure 14 Information within the SOHN paths. Each path in the 
SOHN provides structured knowledge about the SAR that can be 
used to discover activity cliffs, suggest lead optimisation or 
interpret predictions. 
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SAR landscape of the reference dataset similarly to 
SAR trees or pathways [24] and Scaffold trees [25] 
visualisations. In the latter, the information is structure- 
centric whereas in SOHN it is hypothesis centric. Hy- 
potheses are organised in SAR paths going from generic 
key rules to factual evidence via intermediate hypotheses 
that capture important refinement factors. Each path from 
the root hypothesis to more specific hypotheses and even- 
tually to real examples describes a SAR route that can 
contribute to an explanation of the activity of these exam- 
ples (Figure 14). 

It is noteworthy that even though, in this case, the hy- 
potheses were generated using a Decision Tree, their or- 
ganisation inside the SOHN has a network structure and 
not a tree based structure. Indeed specific hypotheses 
may have several more generic parents and vice versa (as 
implied in Figures 9 and 13). 

SOHN as a predictive model 

When human experts try to predict the activity of 
an unseen compound (query compound) they usually 
analyse the chemical structure and search for parts 
of their knowledge that can be applied to the struc- 
ture. The expert builds up a collection of hypotheses 
about the activity of the query based on these se- 
lected elements of knowledge. When some elements 
are contradictory, the expert will weigh the retained 
hypotheses according to their relevance to the ana- 
lysed compound. 

The SOHN approach uses exactly the same method. 
The strongly hierarchical structure of the SOHN allows 
us to define a deterministic and optimised algorithm to 
explore all the possible hypotheses that can apply to a 
new unseen compound and to select the most relevant 
ones. This first step can be achieved using the following 
algorithm: 



The most relevant hypotheses for a query compound x 
can be found by calling the function for the root of the 
network: 

relevantHypotheses x = FindRelevantHypotheses(h 0 , x) . 

The algorithm recursively explores the SOHN network 
starting from its root and identifies the most specific 
hypotheses that apply to the query compound (first 
clause of the hypothesis contract) and for which the 
compound is in the applicability domain (second clause 
of the contract). The resulting hypotheses represent the 
most relevant knowledge for predicting the activity of 
the compound. Each hypothesis retained corresponds to 
the most local model that implements a partial contribution 
of the knowledge. Figure 15 describes a virtual example of 
SOHN exploration leading to two final hypotheses h 6 and 
h 7 . The ability for the model to consider several hypotheses 
simultaneously is a very interesting feature (often absent 
in other methods). Indeed molecules may contain several 
causes of activity or inactivity and it is important for a 
good prediction to take into account all these factors. The 
identification of all the relevant effects is also valuable in- 
formation in a decision process and will help the expert 
assess the prediction based on these factors. 

The selected hypotheses and the corresponding paths 
from the root provide the necessary knowledge to build 
an informed decision about the class of the query. This 
knowledge is composed of hypotheses and factual evi- 
dence which can be used in different ways to construct a 
final overall prediction. To this end, we have developed 
a prediction methodology that takes advantage of all the 
available information in the identified hypotheses to 
build an accurate prediction along with a confidence 
level and supporting examples (Figure 16). This method 
is divided into two main steps: 



FindRelevcmtHypothesesfh, x) 

hypotheses = 0 

ifx eScope(h) then 
for each child of h 

hypotheses = hypotheses U FindRelevantHypothesesfchild, x) 

if hypotheses = 0 and x (EDomain(h) then hypotheses ={h} 
return hypotheses 
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Relevant Local calls 

hypotheses 

Figure 16 Combining local hypotheses prediction into a global prediction. Local prediction based on individual hypotheses are 
combined into an overall call based on a reasoning method that can be adapted to the use case (screening, lead optimisation, risk 
assessment, etc.). 



1. Construct individual hypothesis predictions which will 
contribute to the overall call. In this step each 
hypothesis will predict a class for the query compound. 
Each hypothesis has a confidence level and a set of 
supporting examples attached to its local' prediction. 

2. Combine the hypotheses prediction into a final overall 
call by taking into account the individual hypothesis 
predictions and their respective confidence levels. 
Different ways to combine these hypotheses can be 
considered depending on the desired use case. 

Individual hypothesis prediction (local prediction) 

The path from the root to the hypothesis is used as sup- 
porting knowledge to explain the predictions outcome 
whereas the actual prediction is built from the reference 
dataset examples covered by the hypothesis using a kNN 
[26] algorithm (Figure 17). 



This instance-based prediction takes into account the 
local SAR landscape' (distribution of classes within the 
supporting examples) and identify situations where the 
query compound might occupy a region of the chemical 
space which is not in agreement with the hypothesis. In 
Figure 18, the query compound X\ will be predicted 
positive with high confidence, and x 2 will be positive 
with less confidence than for X\. The query x 3 will be 
predicted negative in contradiction with the hypothesis; in 
this case the hypothesis is said to be overruled and can be 
ignored. 

To make a local prediction from a hypothesis, first the 
k (k =10) nearest neighbours to the query are identified 
amongst the supporting examples of the hypothesis. Then 
a weighted prediction signal is computed using the 
similarity of the examples with the query compound as 
the weighting factor. The square root of the similarity has 




Supporting examples 



Explore 
SOHN 



Unseen instance 




Prediction (instance based) 



r Interpretation (induction) 



Most relevant part 
of the knowledge 

Figure 17 Prediction and interpretation analysis. Each hypothesis contributing to the prediction provides information to support the actual 
prediction (the supporting examples) and interpret the outcome of this prediction (the path from the root to the hypothesis); having simple and 
interpretable hypotheses in the first place will contribute to the transparency of the model. 
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chemical space 
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Figure 18 Accounting for the SAR landscape in the prediction confidence. The local prediction for an individual hypothesis uses a kNN 
model based on the supporting examples. This allows accounting for the SAR landscape in the corresponding region of the chemical space. In 
this illustration, the query x ] will be predicted positive with high confidence, x 2 positive with less confidence and x 3 will be predicted negative 
overruling the hypothesis which can be ignored. 




been identified as a good weighting using cross validation 
and the following measure is used: 

(i) 

(2) 



Vi iX = yj Similarity (x, e*) 



WSh x = 



Where: 



E^> 



w i>x is the weight assigned to the nearest neighbour e x 
given the query compound x. 

Similarity '(x^J is the similarity index between the query 
x and the nearest neighbour e t (using the third clause 
of the hypothesis contract). 
Si is the instance signal (its class) of the nearest 
neighbour e t . = -1 if e^ is a negative example; s t = +1 
if e t is a positive example. For the sake of simplicity, the 
scope of this article will only address binary 
classification. However the methodology supports 
continuous values for the signal and more generally, 
multiple classes using signals distributed over more 
than 2 classes. Regression problems can also be 
considered using a similar formalism. 

Secondly we further moderate the weighted signal to 
account for the average distance of the nearest neighbours 
to the query compound. The final prediction signal for the 
hypothesis h can be written: 



Sh,x 



Thus replacing ws h)X from Equation 2 



Sh,x — " 



(3) 



(4) 



The sign of the final signal S h)X is used to classify the 
compound x according to the hypothesis h and the 



absolute value of the signal is used to measure the confi- 
dence in the prediction. The closer to +1 or -1 the more 
confident we are (according to the current hypothesis) 
that the compound is positive or negative respectively and 
the closer to 0 the lower the level of confidence attached 
to the local prediction. A signal of 0 is seen as equivocal. 

{positive if s^ x > 0 
negative if Sh^x < 0 
equivocal if s^x — 0 

confidences^ = \ s h,x\ 

The confidence ranges from 0 (equivocal) to 1 very 
confident (or certain in the case of an exact match when 
the hypothesis is an example). 

This methodology is general and valid for all the hy- 
potheses including extreme situations: 

In the case of an exact match between the query and 
an example, the hypothesis search results in a single 
hypothesis (the example itself) and the prediction 
reflects the factual data: 

if3e,e = x then class^x is class(e); confidence^ is 1 

Similarly, at the opposite extreme, there is always at least 
one hypothesis that applies to the query compound 
which is the root of the SOHN (which matches the 
whole chemical space). If no other hypothesis is found 
during the SOHN exploration the kNN is applied to the 
supporting examples of the root (h 0 ) which corresponds 
to the whole reference dataset. 

Overall call (global prediction) 

Once a prediction and a confidence level have been calcu- 
lated for each of the relevant hypotheses we can construct 
an overall call based on these values. Different reasoning 
heuristics can be used depending on the use case. Our 
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default reasoning algorithm simply weights each hypothesis 
according to its confidence level Additionally two parame- 
ters 'a and 'b' are introduced to control the signal threshold 
for equivocal prediction (a) and the balances between sen- 
sitivity and specificity (b). The overall signal s x for the 
query compound can be calculated as the follows: 

^^7-1 confidence^ 

( positive if s x > a + b 
class x = < negative if s x < a-b 

I equivocal if a-b<s x <a + b 

confidence x — \s x \ 
Where: 

• m is the number of relevant hypotheses 

• a is the signal threshold for separating two classes 
(in a binary classification problem). The default 
value for a is 0 

• b is a minimum level the signal must reach to 
become significant (otherwise it is considered as 
equivocal). The default value for b is 0. 

The number of hypotheses used for a given prediction 
should not be too important in order to maintain the 
interpretability of the model Figure 19 shows the distri- 
bution of the number of hypotheses found per query 
during the 5-fold cross-validation experiment using the 
previously described mutagenicity dataset. From the figure 
we can conclude that the most common case is a predic- 
tion based on 2 hypotheses. A closer analysis reveals that 



Hypothesis count distribution 

3000 
2500 



r 2000 




0 1 234567B9 10 More 
U of hypotheses 



Figure 19 Number of hypotheses used per prediction. The figure 
presents a distribution of the hypothesis count per prediction during a 
5 fold cross-validation experiment using the previous mutagenicity 
dataset (8201 structures). The most common case is a prediction based 
on 2 hypotheses and 90% of the predictions were based on 4 or 
less hypotheses. 

V / 



90% of the compounds were predicted using 4 or fewer 
hypotheses which favours an easy interpretation. 

Prediction confidence 

The confidence metric was designed to assist the end user 
in a decision process and to facilitate the comparison of 
different models with the assumption that in each case 
there is a comparable correlation between confidence and 
accuracy. Ideally a confidence metric would have a linear 
correlation between the individual prediction accuracy 
estimate and the observed accuracy. The behaviour of this 
confidence measure has been assessed by plotting the 
prediction confidence value against the observed accuracy 
during a 5-fold cross-validation using the previously de- 
scribed dataset. The prediction results were merged and 
split into 5 equidistant confidence bins for which the 
corresponding accuracy was plotted (Figure 20, left). As 
expected the predictions with confidence towards 0 have 
accuracy close to a random predictor (close to 50%) and 
prediction with a high confidence towards 1 approach the 
performance of a perfect model (close to 100%). 

Although the correlation between confidence and accur- 
acy is positive and monotonic, it is not linear and in order 
to provide a more intuitive confidence measure (linearly 
correlated to the accuracy) the models confidence value is 
refitted using an order 3 polynomial transformation T. This 
transformation provides the final normalized confidence 
called standard confidence (Figure 20, right) 

standard confidence x = T (raw confidence^) 

The exact coefficients of this fit function T are dependent 
on the SOHN model and are evaluated using the confi- 
dence vs. accuracy data measured during cross validation. 

The final overall prediction provides the following set 
of information: 

• class x : predicted class 

• standard confidence^, normalised confidence level 
for the prediction 

• {h}: Set of hypotheses relevant to the query compound 
and which have been used in the overall call. 

And for each hypothesis: 

• Class X; h: The local class prediction for this hypothesis 

• confidence^, the confidence associated with the hypothesis 

• {e}\ a set of examples supporting the hypothesis 

As we can see the SOHN prediction methodology 
provides rich information for the end user to facilitate 
the assessment of the models conclusion. The expert 
can use the explanation inferred by the set of relevant 
hypothesis, integrate the confidence metric to establish 
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Figure 20 Raw confidence vs. Standardised confidence. Left, the raw confidence estimate and the actual observed accuracy correlate well 
but in a non linear way. Right, the standard confidence models has a more linear correlation with the observed accuracy after applying a 
polynomial fit and provides an intuitive and normalized estimate of the accuracy for each individual prediction. 



a trust level and use the supporting examples (and 
their similarity with the query compound) to further 
refine the assessment. Finally using their own know- 
ledge the experts can reject or confirm the models 
conclusion on the basis of a transparent prediction. 
Transparency was a major driver when developing this 
method. 

Applicability domain 

When selecting relevant hypotheses the query compound 
has to be in the applicability domain of the retained 
hypotheses. Each type of hypotheses defines its own 
applicability domain (second clause of the contract); the 
domain should depend upon the type of information 
captured in this type of hypotheses. For instance, in our 
example of structural hypotheses, the applicability domain 
was defined based upon the dictionary of fragments con- 
structed when fragmenting the reference dataset. In order 
to be inside the domain of applicability all parts of the 
query compounds must be covered by at least one non 
rare' fragment of the dictionary where non rare' means 
fragments occurring in at least 4 examples in the reference 
dataset. 

The applicability domain is therefore delegated to the 
hypothesis types and this way we decouple the overall 
applicability domain management from the prediction 
algorithm. New types of hypotheses can be easily inte- 
grated in the methodology without having to revisit the 
prediction mechanism. If the applicability domain de- 
pends on a reference collection of structures it must be 
defined using the SOHN reference dataset since this is 
the chemical space used for the kNN prediction. 

Comparison with other methods 

The SOHN methodology applied to the mutagenicity 
endpoint was compared to the Support Vector Machine 



(SVM), Random Forest (RF), k Nearest Neighbours (kNN) 
and Decision tree (DT) learning methods. For the hypoth- 
eses generation a recursive partitioning method (actually 
a simplified version of a decision tree as described in 
Figure 12) was used to identify fragments that yield the 
maximum information gain [27]. This experiment does 
not intend to directly compare quantitative performance 
since the actual knowledge extraction algorithm is a 
simplified decision tree for which we do not expect 
highly optimized performance compared to robust SVM 
and RF models. The goal is to verify that we can indeed 
decouple the learning method from the prediction algo- 
rithm using the SOHN approach without a significant 
loss of performance relative to the knowledge mining 
technique. Additionally we would like to observe the be- 
haviour of the SOHN prediction algorithm in the case 
of a different chemical space compared to the other 
learning methods. 

A dataset of public mutagenicity data was extracted 
from the Vitic database [28]. After curation the dataset 
contained 8,201 compounds divided into 4,152 mutagens 
(positives) and 4,049 non mutagens (negatives). In parallel, 
5 times cross validations were performed using SOHN, 
SVN, RF, kNN and DT models for the same dataset and 



Table 1 Comparison of models for the public dataset 

Cross validation using the public dataset (8201 structures) 



Fragments 


BAC 


SEN 


SPEC 


PPV 


NPV 


SOHN 


0.78 


0.78 


0.78 


0.79 


0.78 


Random Forest 


0.79 


0.83 


0.76 


0.78 


0.76 


SVM 


0.81 


0.82 


0.81 


0.82 


0.81 


kNN 


0.76 


0.73 


0.80 


0.79 


0.80 


Decision Tree 


0.77 


0.76 


0.79 


0.79 


0.79 



BAC = balanced accuracy, SEN = sensitivity, SPEC = specificity, PPV = positive 
predictivity, NPV = negative predictivity. 
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SOHN 
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Figure 21 Comparison of models for the public dataset. The 

performance of the SOHN model (using recursive partitioning as the 
knowledge mining algorithm) is comparable to other mainstream 
machine learning methods. 



using the fragments generated in the fragmentation 
process as descriptors in order to use the same feature 
space. 

Table 1 and Figure 21 show the results for the differ- 
ent models. The best model is obtained using the SVM 
algorithm followed by the Random Forest model. The 
SOHN model ranks third. Interestingly the Decision 
Tree and the kNN algorithms do not outperform the 
SOHN model which means that combining the Decision 
Tree approach for mining hypotheses and the SOHN/kNN 
method for predicting does not induce a loss of predictive 
performance. Although this is mainly a qualitative experi- 
ment, it is encouraging to see that the SOHNs perform- 
ance is comparable to other techniques and in addition 
provides transparent predictions. 

A second experiment was run to analyse the behaviour 
of the SOHN model using a chemical space different 
from the training set. It is well known that toxicity pre- 
diction models trained on public data usually perform 
less well on proprietary data due to differences in chemical 
space coverage for reason of confidentiality [29]. Thanks 
to a data sharing initiative with several pharmaceutical 
companies we have had access to a dataset of intermedi- 
ates used during the process of drug development. The 

Table 2 Comparison of models for the confidential 
dataset 



Balanced accuracy 



Decision Tree 




SOHN 

■ Random Forest 

■ SVM 

■ KNN 

■ Decision Tree 



0.50 0.60 0.70 0.80 0.90 1.00 

Figure 22 Comparison of models for the confidential dataset. 

The SOHN model's performance is comparable to other methods 
and seems to be slightly more robust when applied to the 
confidential chemical space. 



chemical space represented by these compounds is spe- 
cific to the pharmaceutical industry and it was interesting 
to analyse how different models project from the public 
training space onto this confidential test space. A model 
able to transpose from one chemical space to the other is 
especially valuable in the context of risk assessment. 

All 5 models were trained with the whole public data- 
set (8201) structures and the confidential intermediates 
dataset was used as a test set. The intermediates dataset 
contains 800 compounds of which only 30% are mutagens. 
It is a chemical space biased toward negative instances 
which adds difficulty to the prediction task. The same 
protocol as for the previous experiment was used. Table 2 
and Figure 22 presents the results from this experiment. It 
can be seen that, as expected, all models perform less 
well on this challenging confidential chemical space. The 
SOHN model seems to be slightly more robust providing 
the best results in each category. 

Finally, to complete this comparison, a second external 
validation was performed using a mutagenicity dataset 
provided by the Center for Food Safety and Applied 
Nutrition (CFSAN) [30]. Again, all 5 models were trained 
with the whole public dataset (8201) structures. The 
CFSAN dataset represents a chemical space used in a 
regulatory context and contains 1399 structure not du- 
plicated in the training dataset. Like the intermediates 

Table 3 Comparison of models for the CFSAN dataset 



External validation using the confidential dataset (800 structures) External validation using the FDA/CFSAN dataset (1399 structures) 





BAC 


SEN 


SPEC 


PPV 


NPV 




BAC 


SEN 


SPEC 


PPV 


NPV 


SOHN 


0.67 


0.53 


0.80 


0.55 


0.80 


SOHN 


0.75 


0.64 


0.86 


0.75 


0.79 


Random Forest 


0.64 


0.50 


0.77 


0.50 


0.77 


Random Forest 


0.77 


0.74 


0.80 


0.70 


0.83 


SVM 


0.64 


0.50 


0.77 


0.50 


0.77 


SVM 


0.78 


0.70 


0.85 


0.74 


0.82 


kNN 


0.63 


0.48 


0.79 


0.51 


0.79 


kNN 


0.73 


0.62 


0.83 


0.70 


0.78 


Decision Tree 


0.62 


0.45 


0.80 


0.50 


0.80 


Decision Tree 


0.74 


0.62 


0.87 


0.74 


0.78 



BAC = balanced accuracy, SEN = sensitivity, SPEC = specificity, PPV = positive 
predictivity, NPV = negative predictivity. 



Results for the confidential dataset. BAC = balanced accuracy, SEN = sensi 
SPEC = specificity, PPV = positive predictivity, NPV = negative predictivity. 
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Figure 23 Comparison of models for the CFSAN dataset. The 

SOHN model's performance is comparable to other methods on 
this dataset. 



dataset, the CFSAN chemical space is strongly biased 
toward negative instances (874 negative instances for 525 
positive instances). Table 3 and Figure 23 indicate com- 
parable performances between the different models. 

Discussion 

The approach has successfully decoupled the learning 
phase from the prediction phase at the knowledge level 
by representing the knowledge in the form of hypotheses. 
While hypotheses were discovered using a recursive parti- 
tioning method, the corresponding Decision Tree was not 
used to perform the prediction. This separation did not 
lead to a loss of performance and actually even slightly 
improved it. 

The SOHN approach defines a consistent modular 
learning and prediction framework where learning methods 
can be interchanged or combined and where higher pre- 
diction or data analysis algorithms can be developed 
(Figure 24). 

From the user s viewpoint the SOHN approach results 
in a very transparent tool for knowledge discovery and 
prediction purposes. Early performance tests show that 
the approach provides results comparable to established 
methods even when using only 2 related sources of hy- 
potheses (training examples and a simplified recursive 
partitioning to generate hypotheses from these examples). 
These results are very encouraging given the early stage of 



the development of this approach. The next step will be to 
combine different sources of knowledge (e.g. predefined 
set of structural static patterns like MACCS keys (Molecu- 
lar ACCess System) or patterns mined by human experts) 
and merge the resulting knowledge into a single SOHN. 
From the prediction perspective we expect improved per- 
formance and from the SAR analysis viewpoint it will be 
interesting to visualise how the expert hypotheses are 
blended into the machine learned knowledge. 

Since the SOHN facilitates the combination of different 
learning methods it also provides a mechanism to build 
an ensemble of models at the knowledge level. Combining 
different knowledge sources into a single SOHN is a form 
of stacking [31], in this case "Knowledge stacking". Simi- 
larly other ensemble techniques can be applied like 
bagging [32] and boosting [33] to further improve the 
knowledge extraction process. In this case each bag- 
ging or boosting iteration contributes independently to 
the SOHN. The resulting knowledge is automatically 
merged and organised into a single common SOHN 
for which we can apply the same prediction and SAR 
analysis tools, allowing the end user to access an opti- 
mised model for both predictive and knowledge discovery 
purposes (Figure 25). 

Another key feature of the SOHN approach is to sup- 
port different types of hypotheses. For the mutagenic- 
ity endpoint, only structural hypothesis were used; in 
future work, other important hypothesis types like physico- 
chemical or pharmacophoric hypotheses will be added to 
show how different kinds of knowledge units can be used 
by the prediction algorithm to build transparent and ele- 
gant conclusions. For instance we expect predictions in 
the form of: 

"This compound is active because it contains this 
moiety (structural hypothesis) but the confidence is low 
owing to a high logP value (physico-chemical 
hypothesis). " 

The SOHN framework allows this kind of inference 
even though both hypotheses may have been originated 
from different learning approaches. 



Learning 
Algorithm 1 



Learning 
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Figure 24 Overview of the SOHN approach. 
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Figure 25 Applying ensemble modelling to the SOHN. Different learning methods and different iterations of bagging and boosting 
optimisation can contribute to the same SOHN in order to construct an extensive knowledge asset that can be used for accurate and transparent 
prediction or detailed SAR analysis. 



Finally, the formalisation and unification of the know- 
ledge opens up the potential for combining the hypotheses 
within the SOHN to develop optimisation or repurposing 
algorithms independently. For instance, using the formal 
contract of hypotheses we can design operators to com- 
bine hypotheses regardless of their type (Figure 26). It is 
possible to develop algorithms to identify hypothesis asso- 
ciations that further improve the predictive performance 
of the SOHN and hence extend the learning process at 
the SOHN level (meta-learning). In this context SOHN 
can benefit from the techniques used in formal concept 
analysis. This powerful aspect of the SOHN will also be 
investigated in future work. 

Conclusion 

We have presented a new approach for representing and 
organising knowledge in the form of Self Organising Hy- 
pothesis Networks. This method decouples the learning 
phase from the application phase by introducing an 
intermediate knowledge layer based on the concept of 
hypothesis. A hypothesis represents a simple and inter- 
pretable unit of knowledge which may describe any rele- 
vant aspect of a chemical compound in the context of 
an endpoint. Decoupling the learning and the applica- 
tion framework allows us to combine different sources 
of knowledge into the same formal framework and to 
apply mutualised algorithms. By introducing the concept 
of hypothesis we define a higher level model that is inde- 
pendent of the learning technique and operates at the 
early knowledge level rather than at the later prediction 



N 

1 < logP < 4 




1 < logP < 4 



ru h b 

Figure 26 Combining hypotheses. The abstraction of the 
hypotheses permits new ways of combining knowledge units to be 
defined. Creating hypothesis combination operators is possible; in 
this figure hypothesis h a and h b are combined using a conjunction 
operator to produce h ab . If h ab is a valuable unit of knowledge (e.g. 
provides enough information gain) then it can be kept and inserted 
in the knowledge. This is a form of meta-learning. 
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stage as is the case for consensus models. We have pre- 
sented a simple application to the mutagenicity endpoint 
and shown that the SOHN models performances are 
comparable to other common type of models while 
showing improved robustness when testing with external 
validation sets. The predictions are transparent as the 
result of the interpretable nature of the hypotheses. The 
proposed SOHN prediction algorithm assigns a confidence 
level for each individual prediction; this accuracy estimate 
correlates well with that observed. The hierarchical 
structure of the SOHN also facilitates the identification 
of interesting SAR patterns and activity cliffs; it can be 
used as a powerful assistant in knowledge discovery and 
lead optimisation. 

The SOHN methodology clearly offers further poten- 
tial in the context of SAR analysis and for building 
accurate and transparent predictive modules. This article 
covers the general principle of this new paradigm and 
describes a formal framework to manage hypotheses and 
their application. Although initially focused on binary 
classification tasks, the underlying formal methodology 
should equally apply to multi class classification and 
regression problems. Upcoming research will explore the 
potential of combining different sources of knowledge, 
mixing different types of hypotheses and applying bag- 
ging and boosting optimisation techniques. Future work 
will also investigate the ability to extend the learning 
process inside the SOHN as a form of meta- learning. 
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